它支援以下幾種資料格式
| 資料格式 | loading scripts | 舉例 | 
|---|---|---|
| CSV & TSV | csv | load_dataset("csv", data_files="my_file.csv") | 
| Text files | text | load_dataset("text", data_files="my_file.txt") | 
| JSON & JSON Lines | json | load_dataset("json", data_files="my_file.jsonl") | 
| Pickled DataFrames | pandas | load_dataset("pandas", data_files="my_dataframe.pkl") | 
| 表格取自 Hugging Face 官方 | 
格式名稱和檔案路徑 or URL的參數這邊要補充說明 JSON 和 JSON Lines 哪裡不一樣
{
  "user": {
    "id": 1,
    "name": "John Doe",
    "email": "john.doe@example.com",
    "isStudent": true,
    "courses": [
      {
        "id": 101,
        "title": "Introduction to Programming",
        "instructor": "Jane Smith"
      },
      {
        "id": 102,
        "title": "Data Structures and Algorithms",
        "instructor": "Tom Brown"
      }
    ]
  }
}
{"id": "2834", "tokens": ["星", "巴", "克", "小", "圓", "零", "錢", "包"], "ner_tags": ["B-BRAND", "I-BRAND", "I-BRAND", "O", "O", "B-ITEM", "I-ITEM", "I-ITEM"]}
{"id": "4516", "tokens": ["e", "x", "c", "e", "l", " ", "漸", "層", "魅", "色", "腮", "紅"], "ner_tags": ["B-BRAND", "I-BRAND", "I-BRAND", "I-BRAND", "I-BRAND", "O", "O", "O", "O", "O", "B-ITEM", "I-ITEM"]}
{"id": "8103", "tokens": ["m", "e", "k", "o", "魔", "翹", "美", "型", "纖", "長", "睫", "毛", "膏"], "ner_tags": ["B-BRAND", "I-BRAND", "I-BRAND", "I-BRAND", "O", "O", "O", "O", "O", "O", "B-ITEM", "I-ITEM", "I-ITEM"]}  
from datasets import load_dataset
dataset_url = "https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt"
text_dataest = load_dataset('text', data_files=dataset_url)
print(text_dataest['train'][:5])
text,再將遠端檔案以 URL 的方式傳遞給 load_dataset{
    'text': ['First Citizen:',
    'Before we proceed any further, hear me speak.',
    '',
    'All:',
    'Speak, speak.']
}
from datasets import load_dataset
dataset_url = "https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz"
squad_it_dataset = load_dataset("json", data_files=dataset_url, field="data")
print(squad_it_dataset)
DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})
from datasets import load_dataset
squad_it_dataset = load_dataset("json", data_files="SQuAD_it-test.json", field="data")
print(squad_it_dataset)
DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})
train和test的 DatasetDict 對象像是 SQuAD_it-train.json 和 SQuAD_it-test.json 建立成一個完整的 DatasetDict 對象,這樣的話就可以使用 Dataset.map() 函數同時處理訓練集和測試集。因此我們提供參數 data_files 的字典,將每個分割名稱映射到與該分割相關聯的資料
from datasets import load_dataset
data_files = { "train" : "SQuAD_it-train.json" , "test" : "SQuAD_it-test.json" }
squad_it_dataset = load_dataset( "json" , data_files=data_files, field= "data" )
print(squad_it_dataset)
DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})  
這就是我們需要的資料。我們可以應用各種預處理技術來清理資料、標記評論等。
data_files = { 
  "train" : "train.json" , 
  "test" : "test.json",
  "validation" : "validation.json"
}